Random Forest Analysis to Identify Auxiliary Variables for Missing Data: A Tutorial

Stefany Coxe, Ph.D.

May 30, 2025

Overview

Overview

  • Tutorial extension of:

Hayes, T., Baraldi, A. N., & Coxe, S. (2024). Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (P.S. Stop using Little’s MCAR test). Behavior Research Methods, 56(8), 8608-8639. https://doi.org/10.3758/s13428-024-02494-1

Overview

  • Missing data
    • Mechanisms and patterns of missingness
    • Auxiliary variables
  • Random forest analysis (RFA)
    • Used to select auxiliary variables
  • Example using real data
    • RFA to final analysis

Missing data

Missing data

  • Missing data is extremely common in prevention science
    • Self-report data
    • Longitudinal studies
    • Sensitive personal information
  • Failing to account for missing data in analysis
    • Bias in estimates and standard errors
    • Reduced power

Mechanisms

  • How missing values relate to other variables
    • MCAR: Missing completely at random
      • Missingness unrelated to any observed or unobserved variables
    • MAR: Missing at random
      • Missingness related to other observed variables in the model
    • MNAR: Missing not at random
      • Missingness related to the missing values themselves

Patterns of missingness

  • Linear patterns
    • Missingness increases at high (or low) value of variable
    • Larger impact on means and intercepts
  • Nonlinear patterns
    • Convex: Missingness increases at both high and low values
    • Interactive: Missingness related to combination of variables
    • Collins, Schafer, & Kam (2001); Hayes, Baraldi, & Coxe (2024)
    • Larger impact on variances and regression coefficients

Patterns of missingness

  • Traditional methods of IDing variables related to missingness are better1 at detecting linear patterns than nonlinear patterns
    • Little’s MCAR test, \(t\)-tests, logistic regression
  • Data-driven machine learning methods are good at detecting many different kinds of patterns (i.e., linear, convex, interactive)
    • Random forest analysis, lasso regression
  1. They’re still not great. See Hayes, Baraldi, & Coxe (2024)

Handling missing data

  • Two primary methods to mitigate bias due to missing values
    • Full information maximum likelihood (FIML)
    • Multiple imputation (MI)
  • Assumes missing at random (MAR) mechanism
    • Can accomplish this by including auxiliary variables
    • Don’t have relevant auxiliary variables: MNAR-by-omission
  • Performance depends on selection of relevant auxiliary variables

Auxiliary variables

  • Auxiliary variables are observed variables that are related to both missingness and variables in the model
  • How to select the best auxiliary variables?
    • Too many auxiliary variables
    • Not enough auxiliary variables

Too many auxiliary variables

  • Recommendation from missing data lit: “Inclusive approach”
    • Include all variables related to missingness (Collins, Schafer, & Kam, 2001)
    • (Including variables unrelated to missingness doesn’t hurt)
  • In practice
    • Often more variables than subjects
      • Clinical psych: Dozens of demographic / baseline variables plus several 30 to 40 item scales with ~200 subjects

Not enough auxiliary variables

  • Omit relevant auxiliary variables: Bias due to missingness remains
    • MNAR-by-omission without auxiliary vs MAR with auxiliary
  • Hayes et al. (2024) showed that traditional methods of IDing auxiliary variables often miss complex missing data patterns
    • Random forest doesn’t: Convex and interactive, as well as linear
    • We never know what the missing data pattern is
    • Random forest analysis to identify auxiliary variables

Random forest analysis

Classification and regression trees (CART)

  • Machine learning method for classification
    • “Classification” = binary outcome
    • Here, binary outcome is “missing / not missing”
    • Missing data indicator: 1 if missing, 0 if not missing
  • Mines the data for complex relationships by repeatedly partitioning the data into maximally homogeneous subgroups on the missing data indicator
    • Split the data into groups that are mostly 0s or mostly 1s

Classification and regression trees (CART)

flowchart TB
  A((Root node)) -->|X1 <= 5| C((Internal node))
  A((Root node)) -->|X1 > 5| B((Internal node))
  B((Internal node)) -->|X2 = 0| D["Terminal node 
  P(missing) = 0.1"]
  B((Internal node)) -->|X2 = 1| E["Terminal node 
  P(missing) = 0.8"]
  B((Internal node)) -->|X2 = 2 or 3| F["Terminal node
  P(missing) = 0.2"]
  C((Internal node)) -->|X2 = 0 or 1| G["Terminal node
  P(missing) = 0.3"]
  C((Internal node)) -->|X2 = 2 or 3| H["Terminal node
  P(missing) = 0.9"]

Random forest analysis (RFA)

  • CART tends to overfit
    • Large, complex models (trees) that don’t replicate in new data
    • Leans into the ideosyncracies of the data
  • Random forest analysis reduces overfitting and collinearity issues
    • Repeatedly conducts CART on bootstrap samples
    • Each bootstrap replication uses a random subset of predictors
    • Aggregate the results across all

Random forest analysis (RFA)

  • Variable importance instead of node probability
    • How much a variable contributes to prediction
    • How much a variable helps make splits “pure” (close to 0 or 1)
    • Higher values = more important
      • But no cut-offs
  • Permutation importance
    • Compare observed importance to importance of shuffled (“permuted”) predictors

Example

Example data

  • DeShong, H. L., Grant, D. M., & Mullins-Sweatt, S. N. (2019). Precursors of the emotional cascade model of borderline personality disorder: The role of neuroticism, childhood emotional vulnerability, and parental invalidation. Personality Disorders: Theory, Research, and Treatment, 10, 317-329.
    • Sample 2: \(n\) = 215 participants recruited via MTurk
    • Measured at 3 time points: \(n\) = 215, 167, 157, respectively
    • Subset of data: 105 wave 1 variables, 1 wave 3 variable

Example data

glimpse(dat1)
Rows: 215
Columns: 106
$ Age         <dbl> 28, 34, 33, 51, 36, 26, 50, 32, 46, 57, 37, 26, 26, 61, 34…
$ Gender      <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, …
$ Ethni_1     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ethni_2     <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ethni_3     <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ Ethni_4     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ Ethni_5     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ethni_6     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ MSI_total   <dbl> 7, 7, 1, 2, 5, 7, 3, 0, 2, 0, 9, 7, 3, 1, 1, 0, 1, 1, 4, 1…
$ FFBIAnx     <dbl> 4, 16, 8, 5, 5, 20, 10, 4, 14, 12, 19, 14, 15, 4, 4, 13, 1…
$ FFBIAng     <dbl> 11, 17, 6, 4, 15, 16, 14, 5, 18, 12, 20, 13, 4, 5, 9, 15, …
$ FFBIDesp    <dbl> 12, 17, 9, 10, 4, 14, 6, 4, 7, 12, 18, 17, 4, 5, 5, 9, 13,…
$ FFBISelf    <dbl> 6, 7, 9, 12, 6, 16, 5, 5, 13, 12, 17, 12, 4, 4, 6, 10, 8, …
$ FFBIBeh     <dbl> 7, 13, 7, 9, 13, 12, 12, 7, 15, 12, 15, 11, 10, 5, 10, 16,…
$ FFBIAff     <dbl> 4, 17, 4, 7, 8, 13, 9, 5, 14, 12, 20, 11, 10, 4, 7, 11, 10…
$ FFBIFrag    <dbl> 8, 12, 4, 4, 5, 14, 4, 4, 5, 12, 17, 10, 4, 7, 4, 7, 7, 4,…
$ FFBIDiss    <dbl> 4, 4, 4, 12, 4, 7, 4, 7, 6, 12, 12, 4, 4, 4, 4, 7, 4, 4, 4…
$ FFBITrus    <dbl> 4, 17, 10, 14, 14, 17, 14, 4, 17, 12, 14, 17, 5, 7, 8, 15,…
$ FFBIMani    <dbl> 5, 11, 8, 9, 13, 9, 13, 4, 17, 12, 19, 10, 6, 7, 6, 13, 9,…
$ FFBIOpp     <dbl> 4, 13, 7, 4, 10, 8, 8, 4, 18, 12, 16, 12, 4, 4, 7, 12, 7, …
$ FFBIRash    <dbl> 8, 19, 6, 7, 17, 8, 10, 5, 13, 12, 15, 13, 9, 5, 8, 15, 15…
$ FFBITot     <dbl> 77, 163, 82, 97, 114, 154, 109, 58, 157, 144, 202, 144, 79…
$ EPA_infr    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ EPA_virt    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ Anxiety     <dbl> 4, 19, 8, 11, 8, 17, 8, 4, 12, 12, 17, 16, 16, 4, 4, 14, 1…
$ Anger       <dbl> 4, 20, 4, 10, 13, 16, 9, 4, 16, 12, 19, 14, 6, 6, 11, 12, …
$ Depress     <dbl> 4, 14, 5, 11, 4, 16, 4, 4, 8, 12, 12, 14, 5, 4, 4, 14, 16,…
$ Selfcons    <dbl> 4, 9, 11, 15, 9, 14, 5, 8, 6, 12, 11, 10, 14, 5, 4, 11, 15…
$ Immod       <dbl> 5, 11, 15, 12, 16, 12, 12, 7, 16, 12, 18, 10, 8, 5, 8, 16,…
$ Vulner      <dbl> 4, 12, 5, 9, 5, 13, 7, 5, 11, 12, 20, 16, 7, 4, 5, 14, 12,…
$ Friendl     <dbl> 20, 12, 10, 10, 13, 9, 18, 19, 13, 12, 10, 8, 20, 15, 19, …
$ Gregar      <dbl> 16, 8, 7, 8, 5, 4, 17, 19, 11, 12, 4, 9, 9, 4, 10, 8, 4, 8…
$ Assert      <dbl> 14, 13, 10, 8, 17, 14, 20, 11, 17, 12, 4, 12, 18, 10, 16, …
$ ActLevel    <dbl> 13, 10, 12, 9, 13, 17, 19, 11, 17, 12, 15, 15, 14, 14, 16,…
$ ExciSeek    <dbl> 18, 8, 14, 6, 20, 16, 18, 13, 15, 12, 16, 12, 16, 7, 14, 1…
$ Cheer       <dbl> 20, 13, 16, 11, 19, 13, NA, 15, 15, 12, 18, 8, 17, 16, 16,…
$ Imagin      <dbl> 19, 11, 15, 6, 8, 19, 19, 19, 13, 12, 20, 20, 17, 10, 16, …
$ ArtInt      <dbl> 16, 12, 18, 16, 6, 20, 20, 20, 16, 12, 18, 20, 19, 16, 17,…
$ Emot        <dbl> 11, 15, 9, 14, 10, 20, 14, 9, 15, 12, 20, 16, 19, 5, 12, 1…
$ Adven       <dbl> 12, 6, 18, 8, 8, 18, 15, 16, 12, 12, 9, 15, 9, 10, 14, 8, …
$ Intell      <dbl> 20, 15, 19, 10, 7, 20, 20, 20, 16, 12, 13, 20, 17, 7, 16, …
$ Liberal     <dbl> 12, 12, 18, 16, 17, 17, 14, 18, 14, 12, 15, 16, 16, 12, 14…
$ Trust       <dbl> 18, 7, 16, 15, 10, 10, 9, 19, 7, 12, 15, 5, 15, 16, 13, 9,…
$ Morality    <dbl> 17, 17, 16, 17, 7, 18, 13, 18, 8, 12, 13, 13, 12, 19, 17, …
$ Altruism    <dbl> 20, 12, 18, 14, 15, 19, 18, 18, 10, 12, 16, 9, 20, 17, 17,…
$ Coop        <dbl> 20, 10, 18, 20, 7, 18, 12, 20, 7, 12, 5, 9, 16, 20, 10, 8,…
$ Modesty     <dbl> 9, 17, 17, 17, 6, 18, 6, 16, 8, 12, 19, 16, 11, 12, 15, 9,…
$ Sympathy    <dbl> 14, 9, 19, 16, 12, 17, 17, 14, 8, 12, 18, 8, 19, 16, 18, 7…
$ SelfEff     <dbl> 20, 20, 14, 14, 17, 17, 19, 17, 19, 12, 10, 13, 20, 19, 17…
$ Order       <dbl> 18, 19, 15, 9, 9, 17, 11, 11, 16, 12, 4, 13, 5, 18, 18, 17…
$ Dutiful     <dbl> 19, 15, 18, 19, 13, 19, 18, 20, 16, 12, 11, 17, 19, 20, 20…
$ AchStriv    <dbl> 18, 14, 12, 11, 15, 17, 20, 14, 17, 12, 9, 15, 20, 18, 18,…
$ SelfDisc    <dbl> 16, 15, 10, 10, 6, 18, 17, 15, 18, 12, 5, 16, 12, 19, 18, …
$ Cautiou     <dbl> 14, 8, 16, 18, 6, 17, 16, 18, 16, 12, 7, 12, 10, 19, 16, 8…
$ Neuro       <dbl> 25, 85, 48, 68, 55, 88, 45, 32, 69, 72, 97, 80, 56, 28, 36…
$ Extraver    <dbl> 101, 64, 69, 52, 87, 73, NA, 88, 88, 72, 67, 64, 94, 66, 9…
$ Openness    <dbl> 90, 71, 97, 70, 56, 114, 102, 102, 86, 72, 95, 107, 97, 60…
$ Agree       <dbl> 98, 72, 104, 99, 57, 100, 75, 105, 48, 72, 86, 60, 93, 100…
$ Consci      <dbl> 105, 91, 85, 81, 66, 105, 101, 95, 102, 72, 46, 86, 86, 11…
$ ARS_Tot     <dbl> 19, 49, 22, 34, 38, 47, 40, 21, 50, 33, 48, 49, 31, 24, 32…
$ CERQ_Tot    <dbl> 18, 36, 19, 21, 15, 32, 23, 15, 8, 8, 24, 24, 18, 12, 20, …
$ RIO_Tot     <dbl> 6, 29, 10, 6, 6, 8, 15, 6, 12, 18, 25, 23, 6, 6, 12, 17, 6…
$ RRS_Tot     <dbl> 44, 58, 34, 42, 44, 59, 39, 37, 39, 22, 60, 42, 61, 24, 52…
$ RSS_Tot     <dbl> 17, 56, 15, 20, 33, 58, 20, 21, 30, 39, 33, 48, 16, 14, 21…
$ MBSSub      <dbl> 6, 6, 7, 7, 4, 6, 4, 5, 10, 4, 5, 16, 4, 4, 4, 5, 4, 5, 4,…
$ MBSSex      <dbl> 4, 8, 4, 4, 5, 6, 4, 6, 4, 4, 4, 4, 4, 4, 4, 7, 4, 4, 4, 4…
$ MBSHarm     <dbl> 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ MBSEat      <dbl> 5, 5, 6, 5, 6, 7, 5, 5, 8, 5, 5, 7, 6, 5, 5, 6, 5, 7, 5, 6…
$ MBSSteal    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ MBSDrive    <dbl> 5, 3, 3, 3, 7, 3, 4, 3, 5, 3, 5, 3, 3, 3, 5, 5, 3, 5, 6, 3…
$ MBSMoney    <dbl> 4, 3, 4, 9, 3, 3, 4, 4, 3, 3, 4, 4, 7, 3, 3, 3, 3, 3, 4, 4…
$ MBSPhyAg    <dbl> 4, 7, 4, 4, 6, 6, 4, 4, 11, 4, 7, 5, 4, 4, 5, 6, 5, 5, 6, …
$ MBSReas     <dbl> 4, 7, 4, 6, 4, 4, 4, 4, 4, 4, 12, 4, 11, 4, 4, 6, 4, 6, 5,…
$ MBS_Tot     <dbl> 38, 45, 38, 44, 41, 42, 35, 37, 52, 33, 49, 49, 45, 33, 36…
$ NegUrg      <dbl> 11, 4, 11, 12, 10, 14, 14, 13, 10, 12, 5, 9, 10, 16, 14, 8…
$ LackPers    <dbl> 13, 13, 13, 11, 11, 16, 13, 11, 15, 8, 9, 16, 15, 12, 14, …
$ LackPrem    <dbl> 13, 8, 13, 13, 7, 13, 12, 16, 12, 8, 12, 13, 14, 12, 15, 1…
$ SensSeek    <dbl> 6, 13, 14, 15, 4, 9, 8, 5, 6, 12, 15, 7, 12, 11, 8, 13, 14…
$ PosUrg      <dbl> 12, 13, 14, 15, 10, 14, 14, 15, 12, 12, 12, 11, 11, 16, 14…
$ DERSnacc    <dbl> 6, 24, 7, 9, 8, 26, 14, 6, 8, 12, 17, 14, 8, 7, 9, 17, 12,…
$ DERSgoal    <dbl> 5, 22, 12, 16, 11, 10, 14, 9, 10, 12, 25, 17, 12, 5, 10, 1…
$ DERSimp     <dbl> 8, 18, 6, 7, 11, 16, 12, 7, 14, 14, 26, 16, 7, 6, 7, 11, 1…
$ DERSawar    <dbl> 11, 10, 10, 16, 23, 10, 8, 14, 12, 24, 25, 14, 10, 14, 12,…
$ DERSstra    <dbl> 8, 33, 12, 10, 11, 27, 21, 8, 14, 18, 28, 33, 11, 9, 10, 2…
$ DERSclar    <dbl> 9, 13, 9, 13, 7, 7, 10, 7, 7, 14, 21, 12, 5, 6, 10, 7, 5, …
$ DERSTot     <dbl> 47, 120, 56, 71, 71, 96, 79, 51, 65, 94, 142, 106, 53, 47,…
$ emovul      <dbl> 108, 79, 50, 56, 53, 88, 94, 56, 81, 21, 113, 67, 79, 46, …
$ SESvmom     <dbl> 31, 36, 65, 53, 63, 84, 46, 111, 36, 27, 67, 119, 55, 74, …
$ invalmom    <dbl> 75, 105, 43, 54, 29, 30, 84, 27, 90, 24, 35, 26, 68, 52, 2…
$ SESvdad     <dbl> 72, 25, 36, 31, 82, 68, 107, 86, 76, 18, 26, 45, 73, 61, 9…
$ invaldad    <dbl> 60, 93, 59, 64, 20, 49, 29, 43, 56, 15, 78, 89, 48, 39, 42…
$ ZungAnxiety <dbl> 20, 38, 31, 28, 27, 42, 46, 21, 36, 55, 41, 43, 28, 24, 24…
$ PDQ_PAR     <dbl> 1, 7, 1, 3, 3, 5, 3, 1, 6, 0, 3, 7, 1, 0, 0, 2, 5, 1, 4, 0…
$ PDQ_SZD     <dbl> 2, 6, 1, 4, 5, 4, 1, 1, 6, 0, 2, 5, 0, 3, 1, 2, 3, 2, 5, 1…
$ PDQ_SZT     <dbl> 3, 5, 2, 2, 4, 4, 2, 0, 5, 0, 3, 5, 0, 2, 0, 6, 3, 1, 3, 0…
$ PDQ_HIS     <dbl> 3, 3, 1, 2, 5, 1, 2, 0, 5, 0, 3, 1, 3, 0, 0, 0, 1, 3, 5, 0…
$ PDQ_NAR     <dbl> 3, 3, 1, 1, 7, 2, 3, 0, 7, 0, 2, 1, 3, 0, 0, 4, 4, 4, 5, 2…
$ PDQ_BOR     <dbl> 2, 8, 1, 2, 4, 6, 4, 3, 6, 1, 7, 5, 4, 2, 1, 4, 4, 1, 5, 2…
$ PDQ_AS      <dbl> 2, 5, 1, 1, 7, 2, 1, 2, 5, 1, 3, 5, 2, 3, 2, 2, 1, 1, 1, 2…
$ PDQ_AVD     <dbl> 0, 2, 3, 4, 0, 7, 1, 0, 2, 0, 6, 4, 2, 0, 0, 3, 7, 0, 7, 0…
$ PDQ_DEP     <dbl> 0, 4, 0, 4, 1, 2, 0, 0, 0, 0, 6, 0, 1, 0, 0, 1, 1, 2, 1, 0…
$ PDQ_OC      <dbl> 2, 5, 3, 2, 4, 4, 3, 0, 5, 0, 4, 4, 2, 1, 2, 3, 5, 4, 6, 2…
$ PDQ_PAG     <dbl> 0, 5, 2, 2, 3, 3, 2, 0, 3, 0, 5, 5, 2, 0, 0, 4, 3, 0, 6, 0…
$ PDQ_DEPR    <dbl> 0, 5, 2, 3, 1, 4, 2, 1, 3, 0, 4, 5, 2, 1, 0, 5, 5, 3, 7, 1…
$ PDQ_TOT     <dbl> 21, 51, 15, 27, 45, 38, 22, 7, 52, 2, 42, 38, 21, 11, 6, 2…
$ CERQ_3      <dbl> 13, NA, 16, 32, 8, 32, 11, 14, 23, NA, NA, 23, NA, 11, NA,…

Analysis plan

  • Prospective regression
    • Predictor at wave 1: Neuroticism (IPIP-NEO-120)
      • Complete data\(^*\)
    • Outcome at wave 3: Cognitive emotion regulation (CERQ)
      • 155 out of 215 participants have data
      • 27.9% missing values

Analysis plan

  • Demonstrate analyses in R lavaan
    • Listwise deletion
    • Analysis using full information maximum likelihood (FIML) estimation with no auxiliary variables
    • Analysis using three different plausible methods for selecting auxiliary variables using random forest analysis

Step 1: Indicator variable

  • Create indicator variable for missingness
    • 1 if variable is missing
    • 0 if variable is present
dat1$M <- ifelse(is.na(dat1$CERQ_3), 
                 1, 
                 0)
  • Predict M to find auxiliary variables
# A tibble: 215 × 2
   CERQ_3     M
    <dbl> <dbl>
 1     13     0
 2     NA     1
 3     16     0
 4     32     0
 5      8     0
 6     32     0
 7     11     0
 8     14     0
 9     23     0
10     NA     1
# ℹ 205 more rows

Step 2: Random forest analysis

  • Use RFA to predict the missingness indicator
    • cforest() function from the party package
rfFit <- party::cforest(M ~ . - CERQ_3, 
                 data = dat1, 
                 control = party::cforest_unbiased(mtry = round(sqrt(105))))
  • M is predicted by what’s to the right of ~
    • Use all variables (.) except variable with missing (- CERQ_3)
  • mtry = round(sqrt(105)): Each bootstrap replicate uses \(\sqrt105\) randomly selected variables as predictors of M

Step 2: Random forest analysis

  • This produces no output on its own
    • How do we select variables based on the analysis?
  • Three plausible approaches
    • Boruta
    • Highest permutation importance
    • “Red line”

Step 2a: Boruta

  • Boruta() function from the Boruta package
  • General “feature selection” algorithm
    • Use for any model that creates “variable importance” values
      • RFA is the default
    • Compare observed importance to importance of shuffled (“permuted”) predictors
      • Permutation importance
    • Iterative method
      • Finds highest importance variable, then looks for next highest

Step 2a: Boruta

library(Boruta)
borutaaux <- Boruta(M ~ . - CERQ_3, 
                    data = dat1, 
                    pValue = .05)
borutaaux
Boruta performed 99 iterations in 5.257454 secs.
 2 attributes confirmed important: Age, EPA_infr;
 102 attributes confirmed unimportant: AchStriv, ActLevel, Adven,
Agree, Altruism and 97 more;
 1 tentative attributes left: DERSimp;
  • Three auxiliary variables
    • Age, EPA_infr, DERSimp

Step 2b: Highest permutation importance

  • permimp() function from the permimp package
  • Compare observed importance to importance of shuffled (“permuted”) predictors
    • Permutation importance
  • Currently no cut-offs or tests
    • Here, select highest importance variable

Step 2b: Highest permutation importance

library(permimp)
impvalues <- permimp(rfFit)$values
which.max(impvalues)
Coop 
  46 
  • One auxiliary variable
    • Coop

Step 2c: Red line

  • Find the lowest permutation importance value
    • It will be negative
  • Select all variables with permutation importance values greater than the absolute value of that
  • Tends to “overselect” variables
    • But that’s ok here, we can be inclusive

Step 2c: Red line

library(permimp)
impvalues <- permimp(rfFit)$values
which.min(impvalues)
DERSgoal 
      81 
impvalues[81]
     DERSgoal 
-0.0006311661 
  • This is the lowest permutation importance value

Step 2c: Red line

imp_data <- as.data.frame(impvalues)
imp_data %>% filter(impvalues > 0.0006311661)
            impvalues
Age      0.0006505083
FFBISelf 0.0011318507
FFBIDiss 0.0007278615
Coop     0.0032977984
AchStriv 0.0006469499
NegUrg   0.0011882498
LackPrem 0.0010355826
DERSclar 0.0011281932
PDQ_AS   0.0006557183

  • Nine auxiliary variables
    • Age, FFBISelf, FFBIDiss, Coop, AchStriv, NegUrg, LackPrem, DERSclar, PDQ_AS

Step 3: Analysis

  • sem() function from the lavaan package
    • Could use any similar SEM package (e.g., Mplus)
  • Prospective regression
    • Neuroticism at W1 predicts cognitive emotion regulation at W3
  • Incorporate auxiliary variables using full information maximum likelihood (FIML) and saturated correlates model
    • Alternative option: Extra DVs model
    • Alternative option: Multiple imputation (MI) instead of FIML

Step 3a: Listwise deletion

library(lavaan)
listwise <- sem("CERQ_3 ~ Neuro", 
                data = dat1)
summary(listwise, rsquare = TRUE)
lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         2

                                                  Used       Total
  Number of observations                           149         215

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.159    0.022    7.315    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.353    4.096    8.631    0.000

R-Square:
                   Estimate
    CERQ_3            0.264

Step 3b: FIML w no auxiliary (0)

library(lavaan)
fimlnoaux <- sem("CERQ_3 ~ Neuro", 
                 data = dat1, 
                 missing = "fiml", 
                 fixed.x = FALSE)
summary(fimlnoaux, rsquare = TRUE)
lavaan 0.6-19 ended normally after 11 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           213         215
  Number of missing patterns                         3            

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.157    0.022    7.309    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.539    1.379    6.192    0.000
    Neuro            61.513    1.534   40.104    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           34.867    3.982    8.757    0.000
    Neuro           490.519   48.047   10.209    0.000

R-Square:
                   Estimate
    CERQ_3            0.258

Saturated correlates model

Figure 5 from Graham (2003)

Saturated correlates model

  • sem.auxiliary() function from the semTools package
    • Wrapper function for sem() from lavaan
    • Set up the same way as sem()
    • But add an aux argument to supply auxiliary variables

Step 3c: Highest importance (1)

library(lavaan)
library(semTools)
highest_analysis_sat <- sem.auxiliary("CERQ_3 ~ Neuro", 
                                      data = dat1, 
                                      aux = c("Coop"),
                                      missing = "fiml")
summary(highest_analysis_sat, rsquare = TRUE)
lavaan 0.6-19 ended normally after 46 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

  Number of observations                           215
  Number of missing patterns                         4

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.156    0.022    7.204    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  Coop ~~                                             
   .CERQ_3            0.767    1.735    0.442    0.658
    Neuro           -32.336    6.053   -5.342    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.590    1.384    6.206    0.000
    Neuro            61.600    1.531   40.241    0.000
    Coop             16.805    0.250   67.280    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.009    4.000    8.752    0.000
    Neuro           490.547   48.015   10.217    0.000
    Coop             13.413    1.294   10.368    0.000

R-Square:
                   Estimate
    CERQ_3            0.253

Step 3d: Boruta (3)

library(lavaan)
library(semTools)
boruta_analysis_sat <- sem.auxiliary("CERQ_3 ~ Neuro", 
                                     data = dat1, 
                                     aux = c("Age", "EPA_infr", "DERSimp"),
                                     missing = "fiml")
summary(boruta_analysis_sat, rsquare = TRUE)
lavaan 0.6-19 ended normally after 119 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                           215
  Number of missing patterns                         8

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.165    0.021    7.705    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  Age ~~                                              
    EPA_infr         -1.068    0.419   -2.548    0.011
    DERSimp         -11.066    3.544   -3.122    0.002
  EPA_infr ~~                                         
    DERSimp           0.819    0.190    4.305    0.000
  Age ~~                                              
   .CERQ_3          -10.069    5.104   -1.973    0.049
  EPA_infr ~~                                         
   .CERQ_3            0.344    0.350    0.983    0.326
  DERSimp ~~                                          
   .CERQ_3            7.062    2.056    3.434    0.001
  Age ~~                                              
    Neuro           -36.600   16.758   -2.184    0.029
  EPA_infr ~~                                         
    Neuro             1.175    0.856    1.372    0.170
  DERSimp ~~                                          
    Neuro            56.651    8.241    6.875    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.244    1.373    6.003    0.000
    Neuro            61.672    1.536   40.141    0.000
    Age              35.312    0.738   47.821    0.000
    EPA_infr          0.220    0.038    5.783    0.000
    DERSimp          10.009    0.320   31.278    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.295    4.076    8.658    0.000
    Neuro           495.255   48.790   10.151    0.000
    Age             117.228   11.307   10.368    0.000
    EPA_infr          0.305    0.030   10.236    0.000
    DERSimp          21.889    2.128   10.286    0.000

R-Square:
                   Estimate
    CERQ_3            0.277

Step 3e: Red line (9)

library(lavaan)
library(semTools)
redline_analysis_sat <- sem.auxiliary("CERQ_3 ~ Neuro", 
                                      data = dat1, 
                                      aux = c("Age", "FFBISelf", "FFBIDiss",
                                              "Coop", "AchStriv", "NegUrg",
                                              "LackPrem", "DERSclar", "PDQ_AS"),
                                      missing = "fiml")
summary(redline_analysis_sat, rsquare = TRUE)
lavaan 0.6-19 ended normally after 482 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        77

  Number of observations                           215
  Number of missing patterns                        10

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.168    0.021    7.850    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  Age ~~                                              
    FFBISelf        -10.421    3.245   -3.211    0.001
    FFBIDiss         -9.336    2.909   -3.210    0.001
    Coop              3.475    2.715    1.280    0.201
    AchStriv          1.855    2.608    0.711    0.477
    NegUrg            6.088    2.325    2.619    0.009
    LackPrem          2.067    1.756    1.178    0.239
    DERSclar         -9.475    3.102   -3.054    0.002
    PDQ_AS           -2.391    1.134   -2.108    0.035
  FFBISelf ~~                                         
    FFBIDiss          9.864    1.311    7.524    0.000
    Coop             -5.960    1.147   -5.197    0.000
    AchStriv         -6.295    1.119   -5.624    0.000
    NegUrg           -7.039    1.024   -6.870    0.000
    LackPrem         -2.034    0.711   -2.862    0.004
    DERSclar         10.180    1.392    7.315    0.000
    PDQ_AS            2.719    0.484    5.615    0.000
  FFBIDiss ~~                                         
    Coop             -4.407    1.006   -4.381    0.000
    AchStriv         -4.738    0.980   -4.834    0.000
    NegUrg           -4.713    0.871   -5.413    0.000
    LackPrem         -1.211    0.627   -1.933    0.053
    DERSclar          8.594    1.226    7.009    0.000
    PDQ_AS            2.107    0.427    4.937    0.000
  Coop ~~                                             
    AchStriv          2.580    0.902    2.862    0.004
    NegUrg            5.804    0.867    6.690    0.000
    LackPrem          2.700    0.620    4.356    0.000
    DERSclar         -5.651    1.096   -5.155    0.000
    PDQ_AS           -3.088    0.437   -7.071    0.000
  AchStriv ~~                                         
    NegUrg            3.363    0.780    4.310    0.000
    LackPrem          2.459    0.596    4.124    0.000
    DERSclar         -6.150    1.075   -5.719    0.000
    PDQ_AS           -0.990    0.375   -2.639    0.008
  NegUrg ~~                                           
    LackPrem          3.102    0.546    5.681    0.000
    DERSclar         -5.807    0.955   -6.083    0.000
    PDQ_AS           -2.224    0.356   -6.256    0.000
  LackPrem ~~                                         
    DERSclar         -2.429    0.684   -3.549    0.000
    PDQ_AS           -1.170    0.259   -4.525    0.000
  DERSclar ~~                                         
    PDQ_AS            2.107    0.451    4.676    0.000
  Age ~~                                              
   .CERQ_3           -9.997    5.079   -1.968    0.049
  FFBISelf ~~                                         
   .CERQ_3            4.376    1.509    2.900    0.004
  FFBIDiss ~~                                         
   .CERQ_3            5.157    1.857    2.777    0.005
  Coop ~~                                             
   .CERQ_3            0.609    1.724    0.353    0.724
  AchStriv ~~                                         
   .CERQ_3            0.767    1.517    0.505    0.613
  NegUrg ~~                                           
   .CERQ_3           -1.512    1.283   -1.178    0.239
  LackPrem ~~                                         
   .CERQ_3            0.912    1.124    0.811    0.417
  DERSclar ~~                                         
   .CERQ_3            2.413    1.768    1.365    0.172
  PDQ_AS ~~                                           
   .CERQ_3            0.490    0.779    0.629    0.529
  Age ~~                                              
    Neuro           -37.838   16.712   -2.264    0.024
  FFBISelf ~~                                         
    Neuro            70.017    8.139    8.603    0.000
  FFBIDiss ~~                                         
    Neuro            36.958    6.443    5.736    0.000
  Coop ~~                                             
    Neuro           -31.997    6.027   -5.309    0.000
  AchStriv ~~                                         
    Neuro           -35.816    5.930   -6.040    0.000
  NegUrg ~~                                           
    Neuro           -40.890    5.485   -7.455    0.000
  LackPrem ~~                                         
    Neuro           -13.578    3.733   -3.638    0.000
  DERSclar ~~                                         
    Neuro            48.444    7.136    6.789    0.000
  PDQ_AS ~~                                           
    Neuro             8.151    2.413    3.378    0.001

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.135    1.362    5.974    0.000
    Neuro            61.367    1.529   40.138    0.000
    Age              35.312    0.738   47.821    0.000
    FFBISelf          8.678    0.293   29.652    0.000
    FFBIDiss          6.228    0.262   23.760    0.000
    Coop             16.805    0.250   67.280    0.000
    AchStriv         15.711    0.241   65.109    0.000
    NegUrg           12.199    0.211   57.879    0.000
    LackPrem         13.395    0.162   82.883    0.000
    DERSclar          9.167    0.280   32.714    0.000
    PDQ_AS            1.116    0.104   10.749    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.283    4.104    8.598    0.000
    Neuro           495.457   48.427   10.231    0.000
    Age             117.228   11.307   10.368    0.000
    FFBISelf         18.385    1.776   10.351    0.000
    FFBIDiss         14.771    1.425   10.368    0.000
    Coop             13.413    1.294   10.368    0.000
    AchStriv         12.339    1.202   10.262    0.000
    NegUrg            9.484    0.921   10.301    0.000
    LackPrem          5.616    0.542   10.368    0.000
    DERSclar         16.884    1.628   10.368    0.000
    PDQ_AS            2.310    0.225   10.269    0.000

R-Square:
                   Estimate
    CERQ_3            0.283

Summary of results

Method # auxiliary CERQ_3 ~ Neuro s.e. \(R^2\) # para
Listwise NA 0.159 0.022 0.264 2
No aux 0 0.157 0.022 0.258 5
Highest 1 0.156 0.022 0.253 9
Boruta 3 0.165 0.021 0.277 20
Red line 9 0.168 0.021 0.283 77

Wrap-up

Final thoughts

  • RFA is a useful approach to select auxiliary variables of missingness
  • It can be implemented in R (and other software)
  • Currently no strong guidelines or cut offs for variable importance
    • See Rothacher & Strobl (2024) for emerging findings

Final thoughts

  • This is real data, so we don’t know the true effect
  • Missing data literature: including more auxiliary variables is better
    • Include as many as you can
    • Use variable importance as a guide on the order

Questions?

References

  • Example dataset
    • IPIP-NEO-120: Carter et al. (2016)
    • CERQ: Garnefski & Kraaij (2007)
  • Variable importance: Debeer, Hothorn, Strobl & Debeer (2021); Debeer & Strobl (2020); Hapfelmeier & Ulm (2013); Rothacher & Strobl (2024)
  • Saturated correlates: Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation models. Structural Equation Modeling, 10, 80-100.

Extra DVs models

Extra DVs model

Figure 3 from Graham (2003)

Step 3c: Highest importance (1)

highest_analysis <- sem("CERQ_3 + Coop ~ Neuro", 
                        data = dat1, 
                        missing = "fiml", 
                        fixed.x = FALSE)
summary(highest_analysis, rsquare = TRUE)
lavaan 0.6-19 ended normally after 27 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9

  Number of observations                           215
  Number of missing patterns                         4

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.156    0.022    7.204    0.000
  Coop ~                                              
    Neuro            -0.066    0.011   -6.265    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .CERQ_3 ~~                                           
   .Coop              0.767    1.735    0.442    0.658

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.590    1.384    6.206    0.000
   .Coop             20.865    0.688   30.314    0.000
    Neuro            61.600    1.531   40.241    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.009    4.000    8.752    0.000
   .Coop             11.281    1.096   10.296    0.000
    Neuro           490.547   48.015   10.217    0.000

R-Square:
                   Estimate
    CERQ_3            0.253
    Coop              0.159

Step 3d: Boruta (3)

boruta_analysis <- sem("CERQ_3 + Age + EPA_infr + DERSimp ~ Neuro", 
                       data = dat1, 
                       missing = "fiml", 
                       fixed.x = FALSE)
summary(boruta_analysis, rsquare = TRUE)
lavaan 0.6-19 ended normally after 79 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        20

  Number of observations                           215
  Number of missing patterns                         8

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.165    0.021    7.705    0.000
  Age ~                                               
    Neuro            -0.074    0.033   -2.241    0.025
  EPA_infr ~                                          
    Neuro             0.002    0.002    1.384    0.166
  DERSimp ~                                           
    Neuro             0.114    0.012    9.355    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .CERQ_3 ~~                                           
   .Age             -10.069    5.104   -1.973    0.049
   .EPA_infr          0.344    0.350    0.983    0.326
   .DERSimp           7.062    2.056    3.434    0.001
 .Age ~~                                              
   .EPA_infr         -0.981    0.411   -2.387    0.017
   .DERSimp          -6.880    2.925   -2.352    0.019
 .EPA_infr ~~                                         
   .DERSimp           0.685    0.160    4.288    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.244    1.373    6.003    0.000
   .Age              39.869    2.161   18.449    0.000
   .EPA_infr          0.074    0.112    0.657    0.511
   .DERSimp           2.954    0.801    3.687    0.000
    Neuro            61.672    1.536   40.141    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.295    4.076    8.658    0.000
   .Age             114.524   11.049   10.365    0.000
   .EPA_infr          0.302    0.029   10.240    0.000
   .DERSimp          15.408    1.524   10.113    0.000
    Neuro           495.255   48.790   10.151    0.000

R-Square:
                   Estimate
    CERQ_3            0.277
    Age               0.023
    EPA_infr          0.009
    DERSimp           0.296

Step 3e: Red line (9)

redline_analysis <- sem("CERQ_3 + Age + FFBISelf + FFBIDiss + Coop + AchStriv + NegUrg + LackPrem + DERSclar + PDQ_AS ~ Neuro", 
                        data = dat1, 
                        missing = "fiml", 
                        fixed.x = FALSE)
summary(redline_analysis, rsquare = TRUE)
lavaan 0.6-19 ended normally after 115 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        77

  Number of observations                           215
  Number of missing patterns                        10

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  CERQ_3 ~                                            
    Neuro             0.168    0.021    7.850    0.000
  Age ~                                               
    Neuro            -0.076    0.033   -2.325    0.020
  FFBISelf ~                                          
    Neuro             0.141    0.009   15.653    0.000
  FFBIDiss ~                                          
    Neuro             0.075    0.011    6.937    0.000
  Coop ~                                              
    Neuro            -0.065    0.010   -6.175    0.000
  AchStriv ~                                          
    Neuro            -0.072    0.010   -7.433    0.000
  NegUrg ~                                            
    Neuro            -0.083    0.008  -10.743    0.000
  LackPrem ~                                          
    Neuro            -0.027    0.007   -3.894    0.000
  DERSclar ~                                          
    Neuro             0.098    0.011    9.006    0.000
  PDQ_AS ~                                            
    Neuro             0.016    0.005    3.571    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .CERQ_3 ~~                                           
   .Age              -9.997    5.079   -1.968    0.049
   .FFBISelf          4.376    1.509    2.900    0.004
   .FFBIDiss          5.157    1.857    2.777    0.005
   .Coop              0.609    1.724    0.353    0.724
   .AchStriv          0.767    1.517    0.505    0.613
   .NegUrg           -1.512    1.283   -1.178    0.239
   .LackPrem          0.912    1.124    0.811    0.417
   .DERSclar          2.413    1.768    1.365    0.172
   .PDQ_AS            0.490    0.779    0.629    0.529
 .Age ~~                                              
   .FFBISelf         -5.074    2.165   -2.343    0.019
   .FFBIDiss         -6.514    2.571   -2.533    0.011
   .Coop              1.031    2.460    0.419    0.675
   .AchStriv         -0.880    2.289   -0.384    0.701
   .NegUrg            2.965    1.834    1.617    0.106
   .LackPrem          1.031    1.672    0.616    0.538
   .DERSclar         -5.776    2.578   -2.241    0.025
   .PDQ_AS           -1.768    1.083   -1.632    0.103
 .FFBISelf ~~                                         
   .FFBIDiss          4.641    0.779    5.957    0.000
   .Coop             -1.438    0.692   -2.079    0.038
   .AchStriv         -1.233    0.633   -1.947    0.052
   .NegUrg           -1.260    0.510   -2.469    0.014
   .LackPrem         -0.115    0.463   -0.248    0.804
   .DERSclar          3.334    0.751    4.441    0.000
   .PDQ_AS            1.567    0.319    4.906    0.000
 .FFBIDiss ~~                                         
   .Coop             -2.020    0.813   -2.484    0.013
   .AchStriv         -2.066    0.758   -2.724    0.006
   .NegUrg           -1.663    0.604   -2.754    0.006
   .LackPrem         -0.198    0.543   -0.365    0.715
   .DERSclar          4.981    0.901    5.526    0.000
   .PDQ_AS            1.499    0.368    4.076    0.000
 .Coop ~~                                             
   .AchStriv          0.268    0.727    0.368    0.713
   .NegUrg            3.163    0.614    5.152    0.000
   .LackPrem          1.824    0.541    3.371    0.001
   .DERSclar         -2.522    0.822   -3.068    0.002
   .PDQ_AS           -2.562    0.384   -6.665    0.000
 .AchStriv ~~                                         
   .NegUrg            0.407    0.535    0.761    0.446
   .LackPrem          1.478    0.503    2.937    0.003
   .DERSclar         -2.648    0.777   -3.408    0.001
   .PDQ_AS           -0.400    0.322   -1.245    0.213
 .NegUrg ~~                                           
   .LackPrem          1.981    0.418    4.742    0.000
   .DERSclar         -1.809    0.614   -2.946    0.003
   .PDQ_AS           -1.551    0.274   -5.663    0.000
 .LackPrem ~~                                         
   .DERSclar         -1.101    0.552   -1.996    0.046
   .PDQ_AS           -0.947    0.240   -3.952    0.000
 .DERSclar ~~                                         
   .PDQ_AS            1.310    0.365    3.592    0.000

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3            8.135    1.362    5.974    0.000
   .Age              39.998    2.144   18.658    0.000
   .FFBISelf          0.006    0.591    0.010    0.992
   .FFBIDiss          1.650    0.703    2.349    0.019
   .Coop             20.768    0.682   30.434    0.000
   .AchStriv         20.147    0.636   31.661    0.000
   .NegUrg           17.263    0.503   34.307    0.000
   .LackPrem         15.077    0.459   32.836    0.000
   .DERSclar          3.167    0.709    4.468    0.000
   .PDQ_AS            0.106    0.300    0.354    0.724
    Neuro            61.367    1.529   40.138    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .CERQ_3           35.283    4.104    8.598    0.000
   .Age             114.339   11.030   10.366    0.000
   .FFBISelf          8.491    0.845   10.053    0.000
   .FFBIDiss         12.014    1.167   10.298    0.000
   .Coop             11.347    1.099   10.324    0.000
   .AchStriv          9.750    0.953   10.232    0.000
   .NegUrg            6.109    0.601   10.173    0.000
   .LackPrem          5.244    0.506   10.363    0.000
   .DERSclar         12.147    1.186   10.245    0.000
   .PDQ_AS            2.176    0.212   10.274    0.000
    Neuro           495.457   48.427   10.231    0.000

R-Square:
                   Estimate
    CERQ_3            0.283
    Age               0.025
    FFBISelf          0.538
    FFBIDiss          0.187
    Coop              0.154
    AchStriv          0.210
    NegUrg            0.356
    LackPrem          0.066
    DERSclar          0.281
    PDQ_AS            0.058

Standardized solutions

Step 3a: Listwise deletion

standardizedSolution(listwise)
     lhs op    rhs est.std    se      z pvalue ci.lower ci.upper
1 CERQ_3  ~  Neuro   0.514 0.056  9.154      0    0.404    0.624
2 CERQ_3 ~~ CERQ_3   0.736 0.058 12.745      0    0.623    0.849
3  Neuro ~~  Neuro   1.000 0.000     NA     NA    1.000    1.000

Step 3b: FIML w no auxiliary (0)

standardizedSolution(fimlnoaux)
     lhs op    rhs est.std    se      z pvalue ci.lower ci.upper
1 CERQ_3  ~  Neuro   0.508 0.059  8.619      0    0.392    0.623
2 CERQ_3 ~~ CERQ_3   0.742 0.060 12.401      0    0.625    0.859
3  Neuro ~~  Neuro   1.000 0.000     NA     NA    1.000    1.000
4 CERQ_3 ~1          1.246 0.248  5.017      0    0.759    1.732
5  Neuro ~1          2.777 0.153 18.168      0    2.478    3.077

Step 3c: Highest importance (1)

standardizedSolution(highest_analysis)
     lhs op    rhs est.std    se      z pvalue ci.lower ci.upper
1 CERQ_3  ~  Neuro   0.503 0.059  8.470  0.000    0.387    0.620
2   Coop  ~  Neuro  -0.399 0.059 -6.800  0.000   -0.514   -0.284
3 CERQ_3 ~~ CERQ_3   0.747 0.060 12.477  0.000    0.629    0.864
4   Coop ~~   Coop   0.841 0.047 17.995  0.000    0.749    0.933
5 CERQ_3 ~~   Coop   0.039 0.087  0.443  0.658   -0.132    0.209
6  Neuro ~~  Neuro   1.000 0.000     NA     NA    1.000    1.000
7 CERQ_3 ~1          1.254 0.249  5.031  0.000    0.766    1.743
8   Coop ~1          5.697 0.248 22.995  0.000    5.212    6.183
9  Neuro ~1          2.781 0.153 18.224  0.000    2.482    3.080

Step 3d: Boruta (3)

standardizedSolution(boruta_analysis)
        lhs op      rhs est.std    se      z pvalue ci.lower ci.upper
1    CERQ_3  ~    Neuro   0.526 0.056  9.335  0.000    0.416    0.637
2       Age  ~    Neuro  -0.152 0.067 -2.264  0.024   -0.283   -0.020
3  EPA_infr  ~    Neuro   0.096 0.069  1.391  0.164   -0.039    0.230
4   DERSimp  ~    Neuro   0.544 0.049 11.020  0.000    0.447    0.641
5    CERQ_3 ~~   CERQ_3   0.723 0.059 12.192  0.000    0.607    0.839
6       Age ~~      Age   0.977 0.020 47.926  0.000    0.937    1.017
7  EPA_infr ~~ EPA_infr   0.991 0.013 75.352  0.000    0.965    1.017
8   DERSimp ~~  DERSimp   0.704 0.054 13.102  0.000    0.599    0.809
9    CERQ_3 ~~      Age  -0.158 0.078 -2.042  0.041   -0.310   -0.006
10   CERQ_3 ~~ EPA_infr   0.105 0.106  0.994  0.320   -0.102    0.313
11   CERQ_3 ~~  DERSimp   0.303 0.078  3.880  0.000    0.150    0.456
12      Age ~~ EPA_infr  -0.167 0.067 -2.489  0.013   -0.298   -0.035
13      Age ~~  DERSimp  -0.164 0.067 -2.453  0.014   -0.295   -0.033
14 EPA_infr ~~  DERSimp   0.317 0.064  4.988  0.000    0.193    0.442
15    Neuro ~~    Neuro   1.000 0.000     NA     NA    1.000    1.000
16   CERQ_3 ~1            1.180 0.243  4.854  0.000    0.704    1.656
17      Age ~1            3.682 0.239 15.385  0.000    3.213    4.151
18 EPA_infr ~1            0.134 0.204  0.654  0.513   -0.266    0.534
19  DERSimp ~1            0.631 0.191  3.302  0.001    0.257    1.006
20    Neuro ~1            2.771 0.153 18.147  0.000    2.472    3.071

Step 3e: Red line (9)

standardizedSolution(redline_analysis)
        lhs op      rhs est.std    se       z pvalue ci.lower ci.upper
1    CERQ_3  ~    Neuro   0.532 0.055   9.615  0.000    0.423    0.640
2       Age  ~    Neuro  -0.157 0.067  -2.352  0.019   -0.288   -0.026
3  FFBISelf  ~    Neuro   0.734 0.032  22.718  0.000    0.670    0.797
4  FFBIDiss  ~    Neuro   0.432 0.056   7.650  0.000    0.321    0.543
5      Coop  ~    Neuro  -0.392 0.058  -6.711  0.000   -0.507   -0.278
6  AchStriv  ~    Neuro  -0.458 0.055  -8.374  0.000   -0.565   -0.351
7    NegUrg  ~    Neuro  -0.597 0.045 -13.364  0.000   -0.684   -0.509
8  LackPrem  ~    Neuro  -0.257 0.064  -4.026  0.000   -0.383   -0.132
9  DERSclar  ~    Neuro   0.530 0.050  10.569  0.000    0.431    0.628
10   PDQ_AS  ~    Neuro   0.241 0.065   3.684  0.000    0.113    0.369
11   CERQ_3 ~~   CERQ_3   0.717 0.059  12.187  0.000    0.602    0.832
12      Age ~~      Age   0.975 0.021  46.531  0.000    0.934    1.016
13 FFBISelf ~~ FFBISelf   0.462 0.047   9.747  0.000    0.369    0.555
14 FFBIDiss ~~ FFBIDiss   0.813 0.049  16.669  0.000    0.718    0.909
15     Coop ~~     Coop   0.846 0.046  18.426  0.000    0.756    0.936
16 AchStriv ~~ AchStriv   0.790 0.050  15.767  0.000    0.692    0.888
17   NegUrg ~~   NegUrg   0.644 0.053  12.096  0.000    0.540    0.749
18 LackPrem ~~ LackPrem   0.934 0.033  28.363  0.000    0.869    0.998
19 DERSclar ~~ DERSclar   0.719 0.053  13.552  0.000    0.615    0.824
20   PDQ_AS ~~   PDQ_AS   0.942 0.032  29.892  0.000    0.880    1.004
21   CERQ_3 ~~      Age  -0.157 0.077  -2.038  0.042   -0.309   -0.006
22   CERQ_3 ~~ FFBISelf   0.253 0.080   3.174  0.002    0.097    0.409
23   CERQ_3 ~~ FFBIDiss   0.250 0.083   3.020  0.003    0.088    0.413
24   CERQ_3 ~~     Coop   0.030 0.086   0.353  0.724   -0.138    0.199
25   CERQ_3 ~~ AchStriv   0.041 0.082   0.506  0.613   -0.119    0.202
26   CERQ_3 ~~   NegUrg  -0.103 0.086  -1.197  0.231   -0.272    0.066
27   CERQ_3 ~~ LackPrem   0.067 0.082   0.815  0.415   -0.094    0.228
28   CERQ_3 ~~ DERSclar   0.117 0.084   1.389  0.165   -0.048    0.281
29   CERQ_3 ~~   PDQ_AS   0.056 0.088   0.633  0.527   -0.117    0.229
30      Age ~~ FFBISelf  -0.163 0.067  -2.443  0.015   -0.293   -0.032
31      Age ~~ FFBIDiss  -0.176 0.066  -2.656  0.008   -0.305   -0.046
32      Age ~~     Coop   0.029 0.068   0.420  0.675   -0.105    0.162
33      Age ~~ AchStriv  -0.026 0.069  -0.385  0.700   -0.161    0.108
34      Age ~~   NegUrg   0.112 0.068   1.648  0.099   -0.021    0.246
35      Age ~~ LackPrem   0.042 0.068   0.618  0.537   -0.091    0.176
36      Age ~~ DERSclar  -0.155 0.067  -2.325  0.020   -0.286   -0.024
37      Age ~~   PDQ_AS  -0.112 0.067  -1.663  0.096   -0.244    0.020
38 FFBISelf ~~ FFBIDiss   0.459 0.055   8.383  0.000    0.352    0.567
39 FFBISelf ~~     Coop  -0.147 0.068  -2.156  0.031   -0.280   -0.013
40 FFBISelf ~~ AchStriv  -0.136 0.068  -2.004  0.045   -0.268   -0.003
41 FFBISelf ~~   NegUrg  -0.175 0.067  -2.600  0.009   -0.307   -0.043
42 FFBISelf ~~ LackPrem  -0.017 0.069  -0.249  0.804   -0.153    0.119
43 FFBISelf ~~ DERSclar   0.328 0.062   5.262  0.000    0.206    0.451
44 FFBISelf ~~   PDQ_AS   0.365 0.060   6.042  0.000    0.246    0.483
45 FFBIDiss ~~     Coop  -0.173 0.066  -2.603  0.009   -0.303   -0.043
46 FFBIDiss ~~ AchStriv  -0.191 0.066  -2.882  0.004   -0.321   -0.061
47 FFBIDiss ~~   NegUrg  -0.194 0.066  -2.925  0.003   -0.324   -0.064
48 FFBIDiss ~~ LackPrem  -0.025 0.068  -0.366  0.715   -0.159    0.109
49 FFBIDiss ~~ DERSclar   0.412 0.057   7.234  0.000    0.301    0.524
50 FFBIDiss ~~   PDQ_AS   0.293 0.063   4.661  0.000    0.170    0.416
51     Coop ~~ AchStriv   0.025 0.069   0.368  0.713   -0.110    0.161
52     Coop ~~   NegUrg   0.380 0.059   6.444  0.000    0.264    0.495
53     Coop ~~ LackPrem   0.236 0.064   3.670  0.000    0.110    0.363
54     Coop ~~ DERSclar  -0.215 0.065  -3.293  0.001   -0.343   -0.087
55     Coop ~~   PDQ_AS  -0.516 0.050 -10.227  0.000   -0.614   -0.417
56 AchStriv ~~   NegUrg   0.053 0.069   0.765  0.444   -0.083    0.188
57 AchStriv ~~ LackPrem   0.207 0.066   3.136  0.002    0.078    0.336
58 AchStriv ~~ DERSclar  -0.243 0.065  -3.739  0.000   -0.371   -0.116
59 AchStriv ~~   PDQ_AS  -0.087 0.069  -1.259  0.208   -0.222    0.048
60   NegUrg ~~ LackPrem   0.350 0.061   5.723  0.000    0.230    0.470
61   NegUrg ~~ DERSclar  -0.210 0.066  -3.163  0.002   -0.340   -0.080
62   NegUrg ~~   PDQ_AS  -0.426 0.056  -7.533  0.000   -0.536   -0.315
63 LackPrem ~~ DERSclar  -0.138 0.067  -2.056  0.040   -0.270   -0.006
64 LackPrem ~~   PDQ_AS  -0.280 0.063  -4.457  0.000   -0.404   -0.157
65 DERSclar ~~   PDQ_AS   0.255 0.064   3.972  0.000    0.129    0.381
66    Neuro ~~    Neuro   1.000 0.000      NA     NA    1.000    1.000
67   CERQ_3 ~1            1.160 0.241   4.818  0.000    0.688    1.631
68      Age ~1            3.694 0.238  15.551  0.000    3.229    4.160
69 FFBISelf ~1            0.001 0.138   0.010  0.992   -0.269    0.272
70 FFBIDiss ~1            0.429 0.194   2.211  0.027    0.049    0.810
71     Coop ~1            5.671 0.247  22.920  0.000    5.186    6.155
72 AchStriv ~1            5.736 0.238  24.123  0.000    5.270    6.202
73   NegUrg ~1            5.606 0.212  26.392  0.000    5.190    6.022
74 LackPrem ~1            6.362 0.304  20.924  0.000    5.766    6.958
75 DERSclar ~1            0.771 0.197   3.919  0.000    0.385    1.156
76   PDQ_AS ~1            0.070 0.199   0.352  0.725   -0.319    0.459
77    Neuro ~1            2.757 0.152  18.192  0.000    2.460    3.054

Miscellaneous

cforest() defaults

  • cforest_control() adds more options: BUT BE CAUTIOUS!
    • mtry: Number of predictors selected (default: 5)
      • Recommended: \(\sqrt(total\ number\ of\ variables)\)
    • ntree: number of trees (default: 500)
      • Can increase with many predictors
    • mincriterion: Test statistic for selecting split points, determines tree depth (default: qnorm(0.9) = 1.28)
      • Set to 0 to grow largest possible trees

stefanycoxe.github.io/SPR2025_Coxe.html